DATA 2002: Data Analytics- Learning from Data is an intermediate unit of study at the University of Sydney. The unit aims to equip students with knowledge and skills that will enable them to embrace data analytic challenges stemming from everyday problems.
This report seeks to identify a good classifier for spam vs non-spam messages and report on its performance (in-sample and out-of-sample).
Three methods were performed namely decision tree/ random forest method, a logistic regression and nearest neighbours approach. This report found that the logistic regression is the best classifier of the approaches performed in the report.
DATA 2002: Data Analytics- Learning from Data is an intermediate unit of study at the University of Sydney. The unit aims to equip students with knowledge and skills that will enable them to embrace data analytic challenges stemming from everyday problems.
As part of semester 2 2018 assessment of the unit of study, students are required to identify a good classifier for spam vs non-spam emails and report on its performance (in-sample and out-of-sample).
The data has 4601 messages with 58 different variables, the objective is to try to predict whether the email was junk email or ‘spam’.
The data can be found here https://archive.ics.uci.edu/ml/datasets/spambase (which gives quite a lot of background information about the data). It is also available in the kernlab package, which is perhaps the simplest way to load the data into R:
library(lattice)
library(ggplot2)
library(caret)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(class)
library(cvTools)
library(stargazer)
data(spam, package = "kernlab")
s = spam
t = spam
glimpse(spam)
## Observations: 4,601
## Variables: 58
## $ make <dbl> 0.00, 0.21, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ address <dbl> 0.64, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ all <dbl> 0.64, 0.50, 0.71, 0.00, 0.00, 0.00, 0.00, 0....
## $ num3d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ our <dbl> 0.32, 0.14, 1.23, 0.63, 0.63, 1.85, 1.92, 1....
## $ over <dbl> 0.00, 0.28, 0.19, 0.00, 0.00, 0.00, 0.00, 0....
## $ remove <dbl> 0.00, 0.21, 0.19, 0.31, 0.31, 0.00, 0.00, 0....
## $ internet <dbl> 0.00, 0.07, 0.12, 0.63, 0.63, 1.85, 0.00, 1....
## $ order <dbl> 0.00, 0.00, 0.64, 0.31, 0.31, 0.00, 0.00, 0....
## $ mail <dbl> 0.00, 0.94, 0.25, 0.63, 0.63, 0.00, 0.64, 0....
## $ receive <dbl> 0.00, 0.21, 0.38, 0.31, 0.31, 0.00, 0.96, 0....
## $ will <dbl> 0.64, 0.79, 0.45, 0.31, 0.31, 0.00, 1.28, 0....
## $ people <dbl> 0.00, 0.65, 0.12, 0.31, 0.31, 0.00, 0.00, 0....
## $ report <dbl> 0.00, 0.21, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ addresses <dbl> 0.00, 0.14, 1.75, 0.00, 0.00, 0.00, 0.00, 0....
## $ free <dbl> 0.32, 0.14, 0.06, 0.31, 0.31, 0.00, 0.96, 0....
## $ business <dbl> 0.00, 0.07, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ email <dbl> 1.29, 0.28, 1.03, 0.00, 0.00, 0.00, 0.32, 0....
## $ you <dbl> 1.93, 3.47, 1.36, 3.18, 3.18, 0.00, 3.85, 0....
## $ credit <dbl> 0.00, 0.00, 0.32, 0.00, 0.00, 0.00, 0.00, 0....
## $ your <dbl> 0.96, 1.59, 0.51, 0.31, 0.31, 0.00, 0.64, 0....
## $ font <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num000 <dbl> 0.00, 0.43, 1.16, 0.00, 0.00, 0.00, 0.00, 0....
## $ money <dbl> 0.00, 0.43, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ hp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hpl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ george <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num650 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ lab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ labs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ telnet <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num857 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ data <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ num415 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ num85 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ technology <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ num1999 <dbl> 0.00, 0.07, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ parts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ direct <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ cs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ meeting <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ original <dbl> 0.00, 0.00, 0.12, 0.00, 0.00, 0.00, 0.00, 0....
## $ project <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0....
## $ re <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ edu <dbl> 0.00, 0.00, 0.06, 0.00, 0.00, 0.00, 0.00, 0....
## $ table <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ conference <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ charSemicolon <dbl> 0.000, 0.000, 0.010, 0.000, 0.000, 0.000, 0....
## $ charRoundbracket <dbl> 0.000, 0.132, 0.143, 0.137, 0.135, 0.223, 0....
## $ charSquarebracket <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0....
## $ charExclamation <dbl> 0.778, 0.372, 0.276, 0.137, 0.135, 0.000, 0....
## $ charDollar <dbl> 0.000, 0.180, 0.184, 0.000, 0.000, 0.000, 0....
## $ charHash <dbl> 0.000, 0.048, 0.010, 0.000, 0.000, 0.000, 0....
## $ capitalAve <dbl> 3.756, 5.114, 9.821, 3.537, 3.537, 3.000, 1....
## $ capitalLong <dbl> 61, 101, 485, 40, 40, 15, 4, 11, 445, 43, 6,...
## $ capitalTotal <dbl> 278, 1028, 2259, 191, 191, 54, 112, 49, 1257...
## $ type <fct> spam, spam, spam, spam, spam, spam, spam, sp...
Fit into a logistic regression
glm1 = glm(type ~., family = binomial, data = spam)
Perform AIC backward stepwise model selection
step.back.aic = step(glm1, direction = "backward", trace = FALSE)
stargazer::stargazer(glm1, step.back.aic, type = "html", column.labels = c("Full model","Stepwise model"))
| Dependent variable: | ||
| type | ||
| Full model | Stepwise model | |
| (1) | (2) | |
| make | -0.390* | -0.469** |
| (0.231) | (0.216) | |
| address | -0.146** | -0.137** |
| (0.069) | (0.065) | |
| all | 0.114 | |
| (0.110) | ||
| num3d | 2.252 | 2.257 |
| (1.507) | (1.507) | |
| our | 0.562*** | 0.566*** |
| (0.102) | (0.102) | |
| over | 0.883*** | 0.825*** |
| (0.250) | (0.245) | |
| remove | 2.279*** | 2.261*** |
| (0.333) | (0.327) | |
| internet | 0.570*** | 0.565*** |
| (0.168) | (0.166) | |
| order | 0.734*** | 0.668** |
| (0.285) | (0.275) | |
| 0.127* | 0.116* | |
| (0.073) | (0.070) | |
| receive | -0.256 | |
| (0.298) | ||
| will | -0.138* | -0.136* |
| (0.074) | (0.073) | |
| people | -0.080 | |
| (0.230) | ||
| report | 0.145 | |
| (0.136) | ||
| addresses | 1.236* | 1.293* |
| (0.725) | (0.703) | |
| free | 1.039*** | 1.048*** |
| (0.146) | (0.145) | |
| business | 0.960*** | 0.945*** |
| (0.225) | (0.221) | |
| 0.120 | ||
| (0.117) | ||
| you | 0.081** | 0.090*** |
| (0.035) | (0.034) | |
| credit | 1.047* | 1.117** |
| (0.538) | (0.553) | |
| your | 0.242*** | 0.233*** |
| (0.052) | (0.049) | |
| font | 0.201 | 0.221 |
| (0.163) | (0.165) | |
| num000 | 2.245*** | 2.193*** |
| (0.471) | (0.467) | |
| money | 0.426*** | 0.442*** |
| (0.162) | (0.169) | |
| hp | -1.920*** | -1.981*** |
| (0.313) | (0.313) | |
| hpl | -1.040** | -1.036** |
| (0.440) | (0.440) | |
| george | -11.767*** | -11.220*** |
| (2.113) | (1.795) | |
| num650 | 0.445** | 0.418** |
| (0.199) | (0.199) | |
| lab | -2.486* | -2.525* |
| (1.502) | (1.525) | |
| labs | -0.330 | |
| (0.314) | ||
| telnet | -0.170 | |
| (0.482) | ||
| num857 | 2.549 | |
| (3.283) | ||
| data | -0.738** | -0.730** |
| (0.312) | (0.308) | |
| num415 | 0.668 | |
| (1.601) | ||
| num85 | -2.055*** | -2.137*** |
| (0.788) | (0.783) | |
| technology | 0.924*** | 0.964*** |
| (0.309) | (0.309) | |
| num1999 | 0.047 | |
| (0.175) | ||
| parts | -0.597 | -0.606 |
| (0.423) | (0.427) | |
| pm | -0.865** | -0.867** |
| (0.383) | (0.383) | |
| direct | -0.305 | |
| (0.364) | ||
| cs | -45.048* | -44.200* |
| (26.598) | (26.427) | |
| meeting | -2.689*** | -2.690*** |
| (0.838) | (0.845) | |
| original | -1.247 | -1.274 |
| (0.806) | (0.823) | |
| project | -1.573*** | -1.619*** |
| (0.529) | (0.535) | |
| re | -0.792*** | -0.796*** |
| (0.156) | (0.155) | |
| edu | -1.459*** | -1.466*** |
| (0.269) | (0.268) | |
| table | -2.326 | -2.356 |
| (1.659) | (1.793) | |
| conference | -4.016** | -4.033*** |
| (1.611) | (1.564) | |
| charSemicolon | -1.291*** | -1.309*** |
| (0.442) | (0.447) | |
| charRoundbracket | -0.188 | |
| (0.249) | ||
| charSquarebracket | -0.657 | |
| (0.838) | ||
| charExclamation | 0.347*** | 0.359*** |
| (0.089) | (0.091) | |
| charDollar | 5.336*** | 5.481*** |
| (0.706) | (0.706) | |
| charHash | 2.403** | 2.202** |
| (1.113) | (1.073) | |
| capitalAve | 0.012 | |
| (0.019) | ||
| capitalLong | 0.009*** | 0.010*** |
| (0.003) | (0.002) | |
| capitalTotal | 0.001*** | 0.001*** |
| (0.0002) | (0.0002) | |
| Constant | -1.569*** | -1.552*** |
| (0.142) | (0.128) | |
| Observations | 4,601 | 4,601 |
| Log Likelihood | -907.883 | -912.438 |
| Akaike Inf. Crit. | 1,931.765 | 1,912.876 |
| Note: | p<0.1; p<0.05; p<0.01 | |
Generate confusion matrix to assess the in-sample accuracy of the predictions from the stepwise model
preds = as.factor(round(predict(step.back.aic, type = "response")))
preds <- as.character(preds)
preds[preds == "0"] <- "nonspam"
preds[preds == "1"] <- "spam"
preds <- as.factor(preds)
truth = as.factor(spam$type)
confusionMatrix(data = preds, reference = truth)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 2669 193
## spam 119 1620
##
## Accuracy : 0.9322
## 95% CI : (0.9245, 0.9393)
## No Information Rate : 0.606
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.857
## Mcnemar's Test P-Value : 3.584e-05
##
## Sensitivity : 0.9573
## Specificity : 0.8935
## Pos Pred Value : 0.9326
## Neg Pred Value : 0.9316
## Prevalence : 0.6060
## Detection Rate : 0.5801
## Detection Prevalence : 0.6220
## Balanced Accuracy : 0.9254
##
## 'Positive' Class : nonspam
##
Therefore the percentage of predicted the correct categories is 93.2%. It has a 4.3% of wrongly classifying a nonspam email as a spam.
Perform 5 fold cross-validation to get a sample accuracy for the stepwise model.
set.seed(1)
spam$pred[step.back.aic$fitted.values >= 0.5] = "spam"
spam$pred[step.back.aic$fitted.values < 0.5] = "nonspam"
a = table(spam$pred, spam$type)
table(spam$type)[1] / dim(spam)[1]
## nonspam
## 0.6059552
mean(spam$type == spam$pred)
## [1] 0.9321887
a[2, 1] / (sum(a[, 1]))
## [1] 0.04268293
b=train(step.back.aic$formula,
data = spam,
method = "glm",
family = "binomial",
trControl = trainControl(
method = "cv", number = 5,
verboseIter = FALSE
))
b
## Generalized Linear Model
##
## 4601 samples
## 43 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3680, 3680, 3682, 3680, 3682
## Resampling results:
##
## Accuracy Kappa
## 0.9297998 0.8520847
The out of sample gives the accuracy of 93% which is just very slightly lower than the in-sample.
Creating a tree classifier with 1% for the complexity parameter (default)
tree = rpart(factor(type) ~ ., data = s, method = "class")
Visualising the trees (2 different layouts)
rpart.plot(tree)
plot(as.party(tree))
summary(tree)
## Call:
## rpart(formula = factor(type) ~ ., data = s, method = "class")
## n= 4601
##
## CP nsplit rel error xerror xstd
## 1 0.47655819 0 1.0000000 1.0000000 0.01828190
## 2 0.14892443 1 0.5234418 0.5499173 0.01541402
## 3 0.04302261 2 0.3745174 0.4473249 0.01425628
## 4 0.03088803 4 0.2884721 0.3331495 0.01263460
## 5 0.01047987 5 0.2575841 0.2923331 0.01194441
## 6 0.01000000 6 0.2471042 0.2719250 0.01157217
##
## Variable importance
## charDollar remove num000 money
## 29 13 10 9
## charExclamation capitalLong credit order
## 7 7 5 4
## hp capitalTotal hpl capitalAve
## 4 3 2 1
## your free charRoundbracket our
## 1 1 1 1
## telnet
## 1
##
## Node number 1: 4601 observations, complexity param=0.4765582
## predicted class=nonspam expected loss=0.3940448 P(node) =1
## class counts: 2788 1813
## probabilities: 0.606 0.394
## left son=2 (3471 obs) right son=3 (1130 obs)
## Primary splits:
## charDollar < 0.0555 to the left, improve=714.1697, (0 missing)
## charExclamation < 0.0795 to the left, improve=711.9638, (0 missing)
## remove < 0.01 to the left, improve=597.8504, (0 missing)
## free < 0.095 to the left, improve=559.6634, (0 missing)
## your < 0.605 to the left, improve=543.2496, (0 missing)
## Surrogate splits:
## num000 < 0.055 to the left, agree=0.839, adj=0.346, (0 split)
## money < 0.045 to the left, agree=0.833, adj=0.321, (0 split)
## credit < 0.025 to the left, agree=0.796, adj=0.169, (0 split)
## capitalLong < 71.5 to the left, agree=0.793, adj=0.158, (0 split)
## order < 0.18 to the left, agree=0.792, adj=0.155, (0 split)
##
## Node number 2: 3471 observations, complexity param=0.1489244
## predicted class=nonspam expected loss=0.2350908 P(node) =0.7544012
## class counts: 2655 816
## probabilities: 0.765 0.235
## left son=4 (3141 obs) right son=5 (330 obs)
## Primary splits:
## remove < 0.055 to the left, improve=331.3223, (0 missing)
## charExclamation < 0.0915 to the left, improve=284.6134, (0 missing)
## free < 0.135 to the left, improve=266.0164, (0 missing)
## your < 0.615 to the left, improve=165.9929, (0 missing)
## capitalAve < 3.6835 to the left, improve=158.6464, (0 missing)
## Surrogate splits:
## capitalLong < 131.5 to the left, agree=0.909, adj=0.045, (0 split)
## charHash < 0.8325 to the left, agree=0.906, adj=0.012, (0 split)
## num3d < 7.125 to the left, agree=0.906, adj=0.009, (0 split)
## business < 4.325 to the left, agree=0.906, adj=0.009, (0 split)
## credit < 1.635 to the left, agree=0.906, adj=0.006, (0 split)
##
## Node number 3: 1130 observations, complexity param=0.03088803
## predicted class=spam expected loss=0.1176991 P(node) =0.2455988
## class counts: 133 997
## probabilities: 0.118 0.882
## left son=6 (70 obs) right son=7 (1060 obs)
## Primary splits:
## hp < 0.4 to the right, improve=91.33732, (0 missing)
## hpl < 0.12 to the right, improve=44.47552, (0 missing)
## charExclamation < 0.0495 to the left, improve=40.43106, (0 missing)
## num1999 < 0.085 to the right, improve=35.90036, (0 missing)
## george < 0.21 to the right, improve=34.65602, (0 missing)
## Surrogate splits:
## hpl < 0.31 to the right, agree=0.965, adj=0.429, (0 split)
## telnet < 0.045 to the right, agree=0.950, adj=0.186, (0 split)
## num650 < 0.025 to the right, agree=0.946, adj=0.129, (0 split)
## george < 0.225 to the right, agree=0.945, adj=0.114, (0 split)
## lab < 0.08 to the right, agree=0.945, adj=0.114, (0 split)
##
## Node number 4: 3141 observations, complexity param=0.04302261
## predicted class=nonspam expected loss=0.1642789 P(node) =0.6826777
## class counts: 2625 516
## probabilities: 0.836 0.164
## left son=8 (2737 obs) right son=9 (404 obs)
## Primary splits:
## charExclamation < 0.378 to the left, improve=173.25510, (0 missing)
## free < 0.2 to the left, improve=152.11900, (0 missing)
## capitalAve < 3.638 to the left, improve= 79.00492, (0 missing)
## your < 0.865 to the left, improve= 69.83959, (0 missing)
## hp < 0.025 to the right, improve= 64.00030, (0 missing)
## Surrogate splits:
## num000 < 0.62 to the left, agree=0.875, adj=0.030, (0 split)
## free < 2.415 to the left, agree=0.875, adj=0.027, (0 split)
## money < 3.305 to the left, agree=0.872, adj=0.007, (0 split)
## business < 1.305 to the left, agree=0.872, adj=0.005, (0 split)
## order < 2.335 to the left, agree=0.872, adj=0.002, (0 split)
##
## Node number 5: 330 observations
## predicted class=spam expected loss=0.09090909 P(node) =0.07172354
## class counts: 30 300
## probabilities: 0.091 0.909
##
## Node number 6: 70 observations
## predicted class=nonspam expected loss=0.1 P(node) =0.01521408
## class counts: 63 7
## probabilities: 0.900 0.100
##
## Node number 7: 1060 observations
## predicted class=spam expected loss=0.06603774 P(node) =0.2303847
## class counts: 70 990
## probabilities: 0.066 0.934
##
## Node number 8: 2737 observations
## predicted class=nonspam expected loss=0.100475 P(node) =0.5948707
## class counts: 2462 275
## probabilities: 0.900 0.100
##
## Node number 9: 404 observations, complexity param=0.04302261
## predicted class=spam expected loss=0.4034653 P(node) =0.087807
## class counts: 163 241
## probabilities: 0.403 0.597
## left son=18 (182 obs) right son=19 (222 obs)
## Primary splits:
## capitalTotal < 55.5 to the left, improve=63.99539, (0 missing)
## capitalLong < 10.5 to the left, improve=54.95790, (0 missing)
## capitalAve < 2.654 to the left, improve=53.67847, (0 missing)
## free < 0.04 to the left, improve=40.70414, (0 missing)
## our < 0.065 to the left, improve=25.38181, (0 missing)
## Surrogate splits:
## capitalLong < 12.5 to the left, agree=0.856, adj=0.681, (0 split)
## capitalAve < 2.805 to the left, agree=0.757, adj=0.462, (0 split)
## your < 0.115 to the left, agree=0.738, adj=0.418, (0 split)
## charRoundbracket < 0.008 to the left, agree=0.693, adj=0.319, (0 split)
## our < 0.065 to the left, agree=0.673, adj=0.275, (0 split)
##
## Node number 18: 182 observations, complexity param=0.01047987
## predicted class=nonspam expected loss=0.2857143 P(node) =0.03955662
## class counts: 130 52
## probabilities: 0.714 0.286
## left son=36 (161 obs) right son=37 (21 obs)
## Primary splits:
## free < 0.845 to the left, improve=21.101450, (0 missing)
## capitalAve < 2.654 to the left, improve=13.432050, (0 missing)
## charExclamation < 0.8045 to the left, improve=10.648500, (0 missing)
## capitalLong < 8.5 to the left, improve= 6.991597, (0 missing)
## re < 0.23 to the right, improve= 6.714619, (0 missing)
## Surrogate splits:
## capitalAve < 3.871 to the left, agree=0.912, adj=0.238, (0 split)
## charSemicolon < 0.294 to the left, agree=0.907, adj=0.190, (0 split)
## email < 3.84 to the left, agree=0.896, adj=0.095, (0 split)
## our < 2.345 to the left, agree=0.890, adj=0.048, (0 split)
## capitalLong < 25 to the left, agree=0.890, adj=0.048, (0 split)
##
## Node number 19: 222 observations
## predicted class=spam expected loss=0.1486486 P(node) =0.04825038
## class counts: 33 189
## probabilities: 0.149 0.851
##
## Node number 36: 161 observations
## predicted class=nonspam expected loss=0.1987578 P(node) =0.03499239
## class counts: 129 32
## probabilities: 0.801 0.199
##
## Node number 37: 21 observations
## predicted class=spam expected loss=0.04761905 P(node) =0.004564225
## class counts: 1 20
## probabilities: 0.048 0.952
In-sample Performance:
type_pred = predict(tree, type = "class")
confusionMatrix(
data = type_pred,
reference = s$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 2654 314
## spam 134 1499
##
## Accuracy : 0.9026
## 95% CI : (0.8937, 0.911)
## No Information Rate : 0.606
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7925
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9519
## Specificity : 0.8268
## Pos Pred Value : 0.8942
## Neg Pred Value : 0.9179
## Prevalence : 0.6060
## Detection Rate : 0.5768
## Detection Prevalence : 0.6451
## Balanced Accuracy : 0.8894
##
## 'Positive' Class : nonspam
##
The accuracy of our tree is 90.3%. It has a 4.8% of wrongly classifying a nonspam email as a spam.
Performance Benchmarking:
table(spam$type)
##
## nonspam spam
## 2788 1813
benchmark = 1813/(2788+1813)
benchmark
## [1] 0.3940448
By assuming that the benchmark model will predict that all emails are spams, we have will achieve a 39.4% accuracy. Our previous tree model achieves a much higher accuracy of 90.3%.
Out-of-sample Performance:
The out-of-sample performance was done using 10 fold cross-validation.
train(type ~ ., data = s,
method = "rpart", trControl = trainControl(method = "cv", number = 10))
## CART
##
## 4601 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4141, 4141, 4140, 4140, 4141, 4141, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.04302261 0.8632898 0.7103242
## 0.14892443 0.7906923 0.5528305
## 0.47655819 0.6710266 0.1965975
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04302261.
The CV procedure suggests 4.3% for the complexity parameter. This gives the out of sample accuracy of 85.7% which is slightly worse than the in-sample. It appears that the decision tree is over-fitting slightly which drags down its out of sample performance.
tree2 = rpart(factor(type) ~ ., data = s, method = "class", control = rpart.control(cp = 0.043))
plot(as.party(tree2))
set.seed(2018)
type_pred2 = predict(tree2, type = "class")
confusionMatrix(
data = type_pred2,
reference = spam$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 2592 327
## spam 196 1486
##
## Accuracy : 0.8863
## 95% CI : (0.8768, 0.8954)
## No Information Rate : 0.606
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7589
## Mcnemar's Test P-Value : 1.312e-08
##
## Sensitivity : 0.9297
## Specificity : 0.8196
## Pos Pred Value : 0.8880
## Neg Pred Value : 0.8835
## Prevalence : 0.6060
## Detection Rate : 0.5634
## Detection Prevalence : 0.6344
## Balanced Accuracy : 0.8747
##
## 'Positive' Class : nonspam
##
Although this results in a lower in-sample accuracy, we believe that this tree with a complexity parameter of 4.3% is derived from the CV procedure, will result in the least over-fitting problem.
set.seed(2018)
rf = randomForest(factor(type) ~ ., data = s)
rf
##
## Call:
## randomForest(formula = factor(type) ~ ., data = s)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 4.65%
## Confusion matrix:
## nonspam spam class.error
## nonspam 2709 79 0.02833572
## spam 135 1678 0.07446222
The random forest has an out of bag error rate of 4.65% which corresponds to an out of bag accuracy of 95.4%, a little better than the decision tree’s accuracy.
X = t %>% select(-type)
fitCtrl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 10)
set.seed(1)
knnFit1 = train(
type ~ ., data = spam,
method = "knn",
trControl = fitCtrl)
knnFit1
## k-Nearest Neighbors
##
## 4601 samples
## 58 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 3680, 3680, 3682, 3680, 3682, 3681, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8057154 0.5913807
## 7 0.7993028 0.5773610
## 9 0.7946516 0.5672550
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
Using caret package to choose the most appropriate k value with repeated times 10. As the result shows above, the most accurate k value is 5, which has the corresponding mean accuracy of 10 times 80%.
knn1 = knn(train = X, test = X, cl = spam$type, k = 5)
confusionMatrix(knn1,spam$type)$table
confusionMatrix(knn1,spam$type)$overall[1] %>% round(2)
knn1
Test for the performance for the k value = 5. Obtaining the confusion matrix of the knn model, the accuracy of the k nearest neighbours model with k value 5 is 87%.
The report shows that even though using a random forest approach yields the highest accuracy of 95.4% of correctly identifying the type of email, it has an error rate of 4.65% of wrongly classifying a nonspam email as a spam email. In contrast, using the logistic approach even though yields a slightly lower accuracy rate of 93.2, it has an error rate of 4.3%. To put this in context of the data, a difference of 0.35% resulted in around 16 nonspam emails wrongly classified as spam emails.
A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.
Alexandros Karatzoglou, Alex Smola, Kurt Hornik, Achim Zeileis (2004). kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software 11(9), 1-20. URL http://www.jstatsoft.org/v11/i09/
Andreas Alfons (2012). cvTools: Cross-validation tools for regression models. R package version 0.3.2. https://CRAN.R-project.org/package=cvTools
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL http://www.jstatsoft.org/v40/i03/.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse
Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.1. https://CRAN.R-project.org/package=stargazer
Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-80. https://CRAN.R-project.org/package=caret
Sarkar, Deepayan (2008) Lattice: Multivariate Data Visualization with R. Springer, New York. ISBN 978-0-387-75968-5
Stephen Milborrow (2018). rpart.plot: Plot ‘rpart’ Models: An Enhanced Version of ‘plot.rpart’. R package version 3.0.4. https://CRAN.R-project.org/package=rpart.plot
Terry Therneau and Beth Atkinson (2018). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-13. https://CRAN.R-project.org/package=rpart
Torsten Hothorn, Achim Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905-3909. URL http://jmlr.org/papers/v16/hothorn15a.html
Torsten Hothorn, Kurt Hornik and Achim Zeileis (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0